In this paper I’ll introduce the idea of using Kullback-Leibler divergence (KLD) as a way to estimate a \(50\%\) power region for log-likelihood ratio test (llrt) to test the hypothesis of pure types vs mixture types over the parameter set.
In the first section I’ll explore the mathematical aspects of the KLD and explain the estimation method. In the second section I’ll present some interesting findings and explain how we can use them as guidelines to test the research’s main hypothesis. Conclusions and future work are discussed in the final section.
KLD can be explained as the expectation of the log likelihood ratio between two probabilistic models with respect to one of the models: \[ KLD = \int log(\frac{f_1(x)}{f_0(x)})f_1(x)dx \] where \(f_1(x)\) denotes the alternative hypothesis and \(f_0(x)\) denotes the null hypothesis. In the context of our research they are the mixture model and the pure types respectively. We can think of this size as a way to measure the “distance” between both models, i.e when the alternative hypothesis is the correct model than the KLD grows, while if the null model is the right model KLD will be close to zero.
Since we’ve described the KLD as the expected value (mean) of the log likelihood ratio we can use MC simulation to estimate it. First we sample observations from the mixture model (alternative hypothesis). Next we compute the llr for each observation by computing the likelihood under the alternative and under the null hypothesis to produce: \[ \frac{f_1(x)}{f_0(x)} \] Applying log and summing the likelihood ratios yields the log-likelihood ratio: \[ \sum_{i=1}^{n} log(\frac{f_1(x_i)}{f_0(x_i)}) \] Next we simply compute the mean - \[ \hat{KLD} = \frac{\sum_{i=1}^{n} log(\frac{f_1(x_i)}{f_0(x_i)})}{n} \]
In this part, we’ll describe the motivation to use KLD as a way to partition the parameters set in to “potential discovery region” and other. FINISH THIS! ## Results
Since we’ve used a MC to estimate KLD let’s validate the simulation. We know that the population should have a mean 0 and variance of 1.
mean(KLD$pop_mean)
## [1] 0.0001043389
sd(KLD$pop_mean)
## [1] 0.01979394
hist(KLD$pop_mean, freq = F, breaks = 100, main = "Histogram of population mean")
And repeating the process with the population variance:
mean(KLD$pop_var)
## [1] 1.000107
sd(KLD$pop_var)
## [1] 0.05180318
hist(KLD$pop_var, freq = F, breaks = 100, main = "Histogram of population mean")
range(KLD$pop_var)
## [1] 0.4015822 1.8155557
Let’s analyze the population variance.
which.max(KLD$pop_var)
## [1] 81856
max_var_data <- subset(KLD, KLD$pop_var == max(KLD$pop_var), c(xi, epsilon,delta,male_var))
with(max_var_data, male_var + (xi + epsilon)*(xi + delta))
## [1] 1
and we can see that the numeric solution yields the expected population varince of 1.
In this section I’ll present an analysis of the KLD computation as described above. We’ll begin with a simple histogram of the estimated KLD (the 2.5 line is marked with vertical red line).
range(KLD$KLD_pop)
## [1] -0.003915761 9.255500926
hist(KLD$KLD_pop, freq = F, breaks = 100, main = "Histogram of KLD estimation")
abline(v = 2.5, col = "red")
As we can see in the following table that about 17.6% of the parameters set lays in the “discovery region”
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cut(KLD$KLD_pop, breaks = c(0,2.5,Inf)) %>%
table() %>%
prop.table()
## .
## (0,2.5] (2.5,Inf]
## 0.8236447 0.1763553
Next we plot \(\xi\) vs KLD:
with(KLD, plot(xi, KLD_pop))
Next we create a surface plot of KLD as a function of \(\xi, \varepsilon, \delta\).
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:plotly':
##
## rename
## The following object is masked from 'package:dplyr':
##
## rename
xi <- 0.1
dat <- subset(KLD, xi == 0.1, c(epsilon,delta,KLD_pop))
point_plot_KLD <- plot_ly(x = ~dat$epsilon, y = ~dat$delta, z = ~dat$KLD_pop) %>% add_markers()
point_plot_KLD
point_plot_KLD <- plot_ly(x = ~KLD$epsilon, y = ~KLD$delta, z = ~KLD$KLD_pop) %>% add_markers()
point_plot_KLD
Next, we’d like to see the maximal value of KLD as a function of \(\xi\)
KLD_xi <- plot_ly(x = ~KLD$xi, y = ~KLD$epsilon, z = ~KLD$KLD_pop) %>% add_markers()
KLD_xi
KLD_means <- plot_ly(x = ~(KLD$xi + KLD$epsilon),
y = ~(-KLD$xi - KLD$delta),
z = ~KLD$KLD_pop) %>%
add_markers()
KLD_means